NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Combining self-supervision and privileged information for representation learning from tabular data

https://doi.org/10.1007/s10115-025-02418-1

Yang, Haoyu; Steinbach, Michael; Melton, Genevieve; Kumar, Vipin; Simon, Gyorgy (April 2025, Knowledge and Information Systems)

Abstract When building predictive models for real-world applications, many data are discarded because conventional learning algorithms cannot utilize it, although such data could be very informative. This paper focuses on representation learning using two types of additional data: privileged information (PI) and unlabeled data. PI refers to data available only during training but not at test time. Existing methods transfer the knowledge embedded in PI via supervised mechanisms, making them unable to use unlabeled data. In contrast, self-supervised learning methods can use unlabeled data but cannot learn from PI. While these techniques appear complementary, as we demonstrate, combining them is non-trivial. This paper introduces the privileged information regularized (PIReg) self-supervised learning framework, which utilizes both PI and unlabeled data to learn better representations.
more » « less
Predicting diabetes clinical outcomes using longitudinal risk factor trajectories

https://doi.org/10.1186/s12911-019-1009-3

Simon, Gyorgy J.; Peterson, Kevin A.; Castro, M. Regina; Steinbach, Michael S.; Kumar, Vipin; Caraballo, Pedro J. (December 2020, BMC Medical Informatics and Decision Making)

Full Text Available
A new representation of disease conditions and treatment pathways accurately predicts mortality and chronic diseases

Ngufor, Che; Caraballo, Pedro; Byrne, Thomas J.; Chen, David; Shah, Nilay D.; Steinbach, Michael; Simon, Gyorgy (November 2019, AMIA 2019 Annual Symposium)

In this study, we introduce a novel representation of patient data called Disease Severity Hierarchy (DSH) that explores specific diseases and their known treatment pathways in a nested fashion to create subpopulations in a clinically meaningful way. As the DSH tree is traversed from the root towards the leaves, we encounter subpopulations that share increasing richer amounts of clinical details such as similar disease severity, illness trajectories, and time to event that are discriminative, and suitable for learning risk stratification models. The proposed DSH risk scores effectively and accurately predict the age at which a patient may be at risk of dying or developing MCE significantly better than a traditional representation of disease conditions. DSH utilizes known relationships among various entities in EHR data to capture disease severity in a natural way and has the additional benefit of being expressive and interpretable. This novel patient representation can help support critical decision making, development of smart EBP guidelines, and enhance healthcare care and disease management by helping to identify and reduce disease burden among high-risk patients.
more » « less
Full Text Available
Evaluating the Impact of Data Representation on EHR-Based Analytic Tasks

Oh, Wonsuk; Steinbach, Michael S.; Castro, M. Regina; Peterson, Kevin A.; Kumar, Vipin; Caraballo, Pedro J.; Simon, Gyorgy J. (August 2019, Medinfo 2019)

Different analytic techniques operate optimally with different types of data. As the use of EHR-based analytics expands to newer tasks, data will have to be transformed into different representations, so the tasks can be optimally solved. We classified representations into broad categories based on their characteristics and proposed a new knowledge-driven representation for clinical data mining as well as trajectory mining, called Severity Encoding Variables (SEVs). Additionally, we studied which characteristics make representations most suitable for particular clinical analytics tasks including trajectory mining. Our evaluation shows that, for regression, most data representations performed similarly, with SEV achieving a slight (albeit statistically significant) advantage. For patients at high risk of diabetes, it outperformed the competing representation by (relative) 20%. For association mining, SEV achieved the highest performance. Its ability to constrain the search space of patterns through clinical knowledge was key to its success.
more » « less
Full Text Available

Search for: All records